-
Notifications
You must be signed in to change notification settings - Fork 247
feat: improve dataset #1893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: improve dataset #1893
Conversation
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
📝 WalkthroughWalkthroughThis PR introduces support for a Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 1 | ❌ 3❌ Failed checks (2 warnings, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
nemo_rl/data/datasets/utils.py (1)
86-104:⚠️ Potential issue | 🟡 Minor
data_subsetis silently ignored in theload_from_diskfallback.If a user provides a
data_subsetwith a path that triggers theload_from_diskfallback (line 102), the subset is silently discarded. Consider adding an assertion (matching the pattern on line 88) to warn the user.Proposed fix
except ValueError as e: # load from local file (save_to_disk format) if "load_from_disk" in str(e): + assert data_subset is None, ( + "data_subset is only supported for huggingface datasets" + ) raw_dataset = load_from_disk(data_path) else: raise e
🤖 Fix all issues with AI agents
In `@docs/guides/dpo.md`:
- Line 39: The link text "response_datasets/tulu3.py" in docs/guides/dpo.md is
misleading because it points to
../../nemo_rl/data/datasets/preference_datasets/tulu3.py; update the doc so link
text matches the actual target (e.g., change the link text to
"preference_datasets/tulu3.py") or adjust the URL to point to the intended
response_datasets path, ensuring the visible text and link destination are
consistent.
In `@docs/guides/rm.md`:
- Line 28: The link display text is inconsistent: the markdown shows
"response_datasets/tulu3.py" while the URL points to
"../../nemo_rl/data/datasets/preference_datasets/tulu3.py"; update the markdown
in rm.md so the link text matches the actual target (e.g., change the display
text to "preference_datasets/tulu3.py") or alternatively change the URL if you
intended to link to response_datasets; edit the specific link near the sentence
"An example implementation can be found in [response_datasets/tulu3.py]" to keep
display text and path consistent.
In `@nemo_rl/data/__init__.py`:
- Line 23: Change the type of the optional subset field to allow explicit nulls:
update the NotRequired annotation for subset in DatasetConfig (and the subset
field in PreferenceDatasetConfig) from NotRequired[str] to NotRequired[str |
None] (or NotRequired[Optional[str]]), ensuring any necessary typing imports are
present so YAML null maps to None without violating the type contract.
🧹 Nitpick comments (1)
nemo_rl/data/datasets/utils.py (1)
94-98: Use keyword argument forload_datasetsubset parameter.
load_dataset(data_path, data_subset)passes the subset as a positional argument, relying onnamebeing the second parameter. Use the explicit keyword argument for clarity and resilience to API changes.Proposed change
# load from huggingface if data_subset: - raw_dataset = load_dataset(data_path, data_subset) + raw_dataset = load_dataset(data_path, name=data_subset) else: raw_dataset = load_dataset(data_path)
Signed-off-by: Yuki Huang <yukih@nvidia.com>
gsm8kcan directly useResponseDatasetto load.Summary by CodeRabbit
Release Notes
New Features
Documentation
task_namefield structure and dataset formatting conventions.Tests